Human Document Classification Using Bags of Words

نویسندگان

  • Florian Wolf
  • Tomaso Poggio
  • Pawan Sinha
چکیده

Humans are remarkably adept at classifying text documents into categories. For instance, while reading a news story, we are rapidly able to assess whether it belongs to the domain of finance, politics or sports. Automating this task would have applications for content-based search or filtering of digital documents. To this end, it is interesting to investigate the nature of information humans use to classify documents. Here we report experimental results suggesting that this information might, in fact, be quite simple. Using a paradigm of progressive revealing, we determined classification performance as a function of number of words. We found that subjects are able to achieve similar classification accuracy with or without syntactic information across a range of passage sizes. These results have implications for models of human text-understanding and also allow us to estimate what level of performance we can expect, in principle, from a system without requiring a prior step of complex natural language processing. Introspection suggests that for text understanding, humans use a form of representation that takes into account structural and layout information in addition to word-level information. This is particularly likely in a typical document classification task scenario where only limited time is available to choose, from a set of documents, those that are relevant to a certain query or interest. In such a scenario, it seems plausible that syntax and layout information, such as headlines or paragraph boundaries, may be important for performing the classification task. However, there is little systematic experimental work that directly tests this expectation; how would classification performance suffer if syntax and layout information were removed from text passages? Following standard practice in the field (Mitchell 1997), we refer to a syntax and layout free representation as a 'bag of words' (BOW). A BOW is a feature vector where each element in the vector indicates the presence (or absence) of a word. Our goal was to test whether human classification performance is compromised with a BOW based representation relative to normally structured text and whether increased time pressure on the task increases the need for a structured representation. Furthermore, we wanted to assess how classification performance with the two representations changes as a function of the number of words included in the passages. In order to probe these questions, we experimentally compared human document classification performance on fully structured documents with performance on a BOW representation, with and without time constraints …

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A New Document Embedding Method for News Classification

Abstract- Text classification is one of the main tasks of natural language processing (NLP). In this task, documents are classified into pre-defined categories. There is lots of news spreading on the web. A text classifier can categorize news automatically and this facilitates and accelerates access to the news. The first step in text classification is to represent documents in a suitable way t...

متن کامل

ارائه روشی برای استخراج کلمات کلیدی و وزن‌دهی کلمات برای بهبود طبقه‌بندی متون فارسی

Due to ever-increasing information expansion and existing huge amount of unstructured documents, usage of keywords plays a very important role in information retrieval. Because of a manually-extraction of keywords faces various challenges, their automated extraction seems inevitable. In this research, it has been tried to use a thesaurus, (a structured word-net) to automatically extract them. A...

متن کامل

Text Categorization Using Predicate-Argument Structures

∗ Most text categorization methods use the vector space model in combination with a representation of documents based on bags of words. As its name indicates, bags of words ignore possible structures in the text and only take into account isolated, unrelated words. Although this limitation is widely acknowledged, most previous attempts to extend the bag-of-words model with more advanced approac...

متن کامل

Using NLP techniques for file fragment classification

The classification of file fragments is an important problem in digital forensics. The literature does not include comprehensive work on applying machine learning techniques to this problem. In this work, we explore the use of techniques from natural language processing to classify file fragments. We take a supervised learning approach, based on the use of support vector machines combined with ...

متن کامل

Multidimensional counting grids: Inferring word order from disordered bags of words

Models of bags of words typically assume topic mixing so that the words in a single bag come from a limited number of topics. We show here that many sets of bag of words exhibit a very different pattern of variation than the patterns that are efficiently captured by topic mixing. In many cases, from one bag of words to the next, the words disappear and new ones appear as if the theme slowly and...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006